This document has been created for exercise and reference purposes. Some plots can be made with qqplot also but I avoided this here to learn ggplot. If you encounter any error or you can give me feedback on how to improve code and analysis, please contact me!
\[ \]
library(foreign)
library(ggplot2)
library(GGally)
library(gvlma)
library(reshape2)
library(car)
library(lm.beta)
\[ \]
setwd('/Users/henrikeckermann/Documents/workspace/Website/mysite/main/templates/Blok 1/Multivariate Statistics/Week 2/HW2')
data_full <- read.spss('REGRESS.SAV',to.data.frame = T)
re-encoding from CP1252
data <- data_full
data$rn <- rownames(data)
write.csv(data_full, file = 'REGRESSION.csv')
\[ \]
head(data)
str(data)
'data.frame': 465 obs. of 6 variables:
$ id : num 1 2 3 4 5 6 7 8 9 10 ...
$ timedrs: num 1 3 0 13 15 3 2 0 7 4 ...
$ phyheal: num 5 4 3 2 3 5 5 4 5 3 ...
$ menheal: num 8 6 4 2 6 5 6 5 4 9 ...
$ stress : num 265 415 92 241 86 247 13 12 269 391 ...
$ rn : chr "1" "2" "3" "4" ...
- attr(*, "variable.labels")= Named chr "Participant number" "Visits to health professionals" "Physical health symptoms" "Mental health symptoms" ...
..- attr(*, "names")= chr "id" "timedrs" "phyheal" "menheal" ...
- attr(*, "codepage")= int 1252
ggpairs(data[,1:4], aes(alpha=0.1),lower = list(continuous = "smooth"))
plot: [1,1] [===========-----------------------------------------------------------------------------------------------------------------------------------------------------------------] 6% est: 0s
plot: [1,2] [======================------------------------------------------------------------------------------------------------------------------------------------------------------] 12% est: 1s
plot: [1,3] [================================--------------------------------------------------------------------------------------------------------------------------------------------] 19% est: 1s
plot: [1,4] [===========================================---------------------------------------------------------------------------------------------------------------------------------] 25% est: 1s
plot: [2,1] [======================================================----------------------------------------------------------------------------------------------------------------------] 31% est: 1s
plot: [2,2] [================================================================------------------------------------------------------------------------------------------------------------] 38% est: 1s
plot: [2,3] [===========================================================================-------------------------------------------------------------------------------------------------] 44% est: 1s
plot: [2,4] [======================================================================================--------------------------------------------------------------------------------------] 50% est: 1s
plot: [3,1] [=================================================================================================---------------------------------------------------------------------------] 56% est: 1s
plot: [3,2] [============================================================================================================----------------------------------------------------------------] 62% est: 0s
plot: [3,3] [======================================================================================================================------------------------------------------------------] 69% est: 0s
plot: [3,4] [=================================================================================================================================-------------------------------------------] 75% est: 0s
plot: [4,1] [============================================================================================================================================--------------------------------] 81% est: 0s
plot: [4,2] [======================================================================================================================================================----------------------] 88% est: 0s
plot: [4,3] [=================================================================================================================================================================-----------] 94% est: 0s
plot: [4,4] [============================================================================================================================================================================]100% est: 0s
\[ \]
1. The relationship between the response variable (Y) and the predictors \(X_i\) is not linear.
Inspect single \(X_i\) vs residuals if possible. Always inspect the fitted values vs the vs sesiduals. Residuals should be randomly distributed. 2. Heteroscedasticity: Non-constant variance across the range of residuals for each predictor.
Look at the residuals. Heteroscedasticity shows when the SD of the redisuals increases with increasing Y resulting in a ‘funnel like appearance’.
3. Residuals not normally distributed
Easy to inspect visually when plotting the residuals. They should be somewhat centered around a horizontal line and symmetrically spread around that line. Use QQ-plot also. 4. Independence of the Errors
Imagine the distance how far subjects live away from a toxic substance that influences the measured variables differs between subjects. In that cases the error depends on that distance. Durbin-Watson statistic is a measure of independence of the error and should lie between 1-3, optimally at 2. For time-series data (or with the distance example): Plot errors as a function of time (distance) and look for patterns. Generally, indepence of error is achieved by a solid experimental design. 5. Absence of Multicollinearity:
There may not be a high correlation between the predictors. Use Variance inflation factor as shown below. 6. Multivariate Outliers
Assumptions that are not covered:
a) All predictors that are reasonable based on previous research or the theory are integrated into the model (not further tested or explained here)
b) The reliability of the measures should be 1 (reasonable cutoff is 0.7). This is not covered here.
\[ \]
ggplot(data, aes(x = phyheal , y = timedrs)) +
geom_point(alpha = 0.4) +
geom_smooth(aes(col = 'red'), method = 'lm', alpha = 0.4) +
geom_smooth(aes(col = 'skyblue'))
ggplot(data, aes(x = menheal , y = timedrs)) +
geom_point(alpha = 0.4) +
geom_smooth(aes(col = 'red'), method = 'lm', alpha = 0.4) +
geom_smooth(aes(col = 'skyblue'))
ggplot(data, aes(x = stress , y = timedrs)) +
geom_point(alpha = 0.4) +
geom_smooth(aes(col = 'red'), method = 'lm', alpha = 0.4) +
geom_smooth(aes(col = 'skyblue'))
\[ \]
fit <- lm(timedrs ~ phyheal + menheal + stress, data)
data$residuals <- resid(fit)
data$residuals.standard <- rstandard(fit)
data$residuals.studentized <- rstudent(fit)
data$fitted <- fitted.values(fit)
summary(fit)
Call:
lm(formula = timedrs ~ phyheal + menheal + stress, data = data)
Residuals:
Min 1Q Median 3Q Max
-14.792 -4.353 -1.815 0.902 65.886
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.704848 1.124195 -3.296 0.001058 **
phyheal 1.786948 0.221074 8.083 5.6e-15 ***
menheal -0.009666 0.129029 -0.075 0.940318
stress 0.013615 0.003612 3.769 0.000185 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 9.708 on 461 degrees of freedom
Multiple R-squared: 0.2188, Adjusted R-squared: 0.2137
F-statistic: 43.03 on 3 and 461 DF, p-value: < 2.2e-16
We should see randomly distributed datapoints in these plots:
ggplot(data, aes(x=phyheal, y=residuals.standard)) +
geom_point(alpha=0.4) +
geom_smooth()
ggplot(data, aes(x=stress, y=residuals.standard)) +
geom_point(alpha=0.4) +
geom_smooth()
ggplot(data, aes(x=menheal, y=residuals.standard)) +
geom_point(alpha=0.4) +
geom_smooth()
ggplot(data, aes(x=fitted, y=timedrs)) +
geom_point(alpha=0.4) +
geom_smooth(se=F, color='green') +
geom_smooth(method='lm', color='red')
ggplot(data, aes(x=fitted, y=residuals)) +
geom_point(alpha=0.4) +
geom_smooth(se=F, color='green') +
geom_smooth(method='lm', color='red')
\[ \]
We check these assumptions by inspecting 3 graphs.
data$fitted <- fit$fitted.values
p.hs3 <- ggplot(data, aes(x = residuals.standard)) +
geom_histogram(aes(y = ..density..), color = 'black', fill = 'white') +
labs(x = 'Standardized Residuals', y = 'Density') +
stat_function(fun = dnorm, args = list(mean = mean(data$residuals.standard), sd = sd(data$residuals.standard)), color = 'red', size = 1)
p.hs3
p.hs2 <- ggplot(data, aes(sample = residuals.standard)) +
stat_qq()
p.hs2
p.hs <- ggplot(data, aes(x = fitted, y= residuals.standard)) +
geom_point(alpha = 0.4, color = 'red') +
geom_smooth(method = 'lm', color = 'black') +
geom_smooth(color = 'yellow', se = F) +
labs(x = 'Fitted Values', y = 'Standardized Residual')
p.hs
\[ \]
Conservative rule for the DWT-test: Values shoud be between >1 and <3 and the closer to 2, the better.
durbinWatsonTest(fit)
lag Autocorrelation D-W Statistic p-value
1 -0.004467856 2.007136 0.998
Alternative hypothesis: rho != 0
\[ \]
vif(fit)
phyheal menheal stress
1.372358 1.441328 1.184410
1/vif(fit)
phyheal menheal stress
0.7286726 0.6938048 0.8443025
mean(vif(fit))
[1] 1.332699
\[ \]
Here we see the standardized or studentized residuals plotted against the leverage. 14 data points are multivariate outliers. 3 of them have also more than 2x the average leverage. There is one point that has relatively high leverage and high deviation, which potentially has a high influence on our model. Generally, 3.7% are more than 2.5SD away, which indicates a poor fit of the model.
mean(data$residuals.standard > 2 | data$residuals.standard < -2) # are 95% within +/- 2?
[1] 0.04516129
mean(data$residuals.standard > 2.5 | data$residuals.standard < -2.5) # are 99% within +/- 2.5?
[1] 0.03655914
data$cooks.distance <- cooks.distance(fit)
data$leverage <- hatvalues(fit)
influence <- data[(data$cooks.distance>0.05 & data$leverage > 2*mean(data$leverage)),]
p.o <- ggplot(data, aes(x = leverage, y = residuals.standard)) +
geom_hline(yintercept = 3, linetype = 'dotted',alpha=0.4) +
geom_hline(yintercept = 2, linetype = 'dotted', alpha=0.4) +
geom_vline(xintercept = 2* mean(data$leverage), linetype = 'dotted',alpha=0.4) +
geom_point(alpha = 0.2) +
geom_point(data = influence, aes(x = leverage, y = residuals.standard), color = 'red') +
geom_text(data = influence, aes(label =rn), vjust = 2)
p.o.rst <- ggplot(data, aes(x = leverage, y = residuals.studentized)) +
geom_hline(yintercept = 3, linetype = 'dotted',alpha=0.4) +
geom_hline(yintercept = 2, linetype = 'dotted', alpha=0.4) +
geom_vline(xintercept = 2* mean(data$leverage), linetype = 'dotted',alpha=0.4) +
geom_point(alpha = 0.2) +
geom_point(data = influence, aes(x = leverage, y = residuals.studentized), color = 'red') +
geom_text(data = influence, aes(label =rn), vjust = 2)
p.o
p.o.rst
\[ \]
plot(fit)
gvlma(fit)
Call:
lm(formula = timedrs ~ phyheal + menheal + stress, data = data)
Coefficients:
(Intercept) phyheal menheal stress
-3.704848 1.786948 -0.009666 0.013615
ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
Level of Significance = 0.05
Call:
gvlma(x = fit)
Value p-value Decision
Global Stat 4901.2493 0.0000 Assumptions NOT satisfied!
Skewness 862.7720 0.0000 Assumptions NOT satisfied!
Kurtosis 4036.5033 0.0000 Assumptions NOT satisfied!
Link Function 0.9991 0.3175 Assumptions acceptable.
Heteroscedasticity 0.9749 0.3235 Assumptions acceptable.
Log transform the dependent variable:
data$timedrs.log <- log((data$timedrs+1))
ggplot(data, aes(x=timedrs.log)) +
geom_histogram(aes(y = ..density..), col = 'black', fill = 'white') +
stat_function(fun = dnorm, color = 'red', args = list(mean(data$timedrs.log), sd = sd(data$timedrs.log)), size = 1)
ggplot(data, aes(sample = timedrs.log)) +
stat_qq()
But according to the Shapiro-Wilk test, the percentage on the logtransformed variable timedrs, W = 0.637, p < .001, is still significantly non-normal.
shapiro.test(data$timedrs.log)
Shapiro-Wilk normality test
data: data$timedrs.log
W = 0.97608, p-value = 6.492e-07
ggplot(data, aes(x = phyheal , y = timedrs.log)) +
geom_point(alpha = 0.4) +
geom_smooth(col = 'red', method = 'lm', alpha = 0.4, se = F) +
geom_smooth(col = 'skyblue')
ggplot(data, aes(x = menheal , y = timedrs.log)) +
geom_point(alpha = 0.4) +
geom_smooth(col = 'red', method = 'lm', alpha = 0.4, se = F) +
geom_smooth(col = 'skyblue')
ggplot(data, aes(x = stress , y = timedrs.log)) +
geom_point(alpha = 0.4) +
geom_smooth(col = 'red', method = 'lm', alpha = 0.4) +
geom_smooth(col = 'skyblue')
data$log.leverage <- hatvalues(fit.log)
data$log.residuals.standard <- rstandard(fit.log)
data$log.residuals <- resid(fit.log)
data$log.cooks.distance <- cooks.distance(fit.log)
data$log.residuals.studentized <- rstudent(fit.log)
mean(data$log.residuals.standard > 2 | data$log.residuals.standard < -2) # are 95% within +/- 2?
[1] 0.06451613
mean(data$log.residuals.standard > 2.5 | data$log.residuals.standard < -2.5) # are 99% within +/- 2.5?
[1] 0.02580645
There remains one multivariate outlier with low leverage. The most influental data point according to Cook’s D has a Cook’s D lower than 0.08 and is therefore not considered problemtatic.
fara <- data[data$log.residuals.standard>=3,]
log.influence <- data[(data$log.cooks.distance>0.07 & data$log.leverage > 2*mean(data$log.leverage)),]
p.log.o <- ggplot(data, aes(x = log.leverage, y = log.residuals.studentized)) +
geom_hline(yintercept = 3, linetype = 'dotted') +
geom_hline(yintercept = 2, linetype = 'dotted') +
geom_hline(yintercept = -2, linetype = 'dotted') +
geom_vline(xintercept = 2* mean(data$leverage), linetype = 'dotted') +
geom_point(alpha = 0.2) +
geom_point(data = log.influence, aes(x = log.leverage, y = log.residuals.studentized), color = 'red') +
geom_text(data = log.influence, aes(label =rn), vjust = 2)+
geom_text(data = fara, aes(label =rn), hjust = 1.2)
p.log.o
durbinWatsonTest(fit.log)
lag Autocorrelation D-W Statistic p-value
1 -0.03493742 2.065067 0.446
Alternative hypothesis: rho != 0
vif(fit.log)
phyheal menheal stress
1.372358 1.441328 1.184410
1/vif(fit.log)
phyheal menheal stress
0.7286726 0.6938048 0.8443025
mean(vif(fit.log))
[1] 1.332699
data$log.fitted <- fit.log$fitted.values
p.log.hs3 <- ggplot(data, aes(x = log.residuals.standard)) +
geom_histogram(aes(y = ..density..), color = 'black', fill = 'white') +
labs(x = 'Log Standardized Residual', y = 'Density') +
stat_function(fun = dnorm, args = list(mean = mean(data$log.residuals.studentized), sd = sd(data$log.residuals.studentized)), color = 'red', size = 1)
p.log.hs3
p.log.hs2 <- ggplot(data, aes(sample = log.residuals.standard)) +
stat_qq()
p.log.hs2
p.log.hs <- ggplot(data, aes(x = fitted, y= log.residuals.standard)) +
geom_point(alpha = 0.4, color = 'red') +
geom_smooth(method = 'lm', color = 'black') +
geom_smooth(color = 'yellow', se = F) +
labs(x = 'Fitted Values', y = 'Log Standardized Residual')
p.log.hs
data$sstress <- sqrt(data$stress)
data$lphyheal <- log(data$phyheal)
data$smenheal <- sqrt(data$menheal)
fit.trans <- lm(timedrs.log ~ lphyheal + smenheal + sstress, data)
data$tleverage <- hatvalues(fit.trans)
data$tcook <- cooks.distance(fit.trans)
data$tstresid <- rstandard(fit.trans)
data$tfitted <- fit.trans$fitted.values
tinfluence <- data[(data$tcook >0.05 & data$tleverage > 2*mean(data$tleverage)),]
tfara <- data[data$tstresid>=3,]
p.trans <- ggplot(data, aes(x = tfitted, y= tstresid)) +
geom_point(alpha = 0.4, color = 'red') +
geom_smooth(method = 'lm', color = 'black') +
geom_smooth(color = 'yellow', se = F) +
labs(x = 'Fitted Values', y = 'Transformed - Standardized Residual')
p.trans.two <- ggplot(data, aes(x = tleverage, y = tstresid)) +
geom_hline(yintercept = 3, linetype = 'dotted') +
geom_hline(yintercept = 2, linetype = 'dotted') +
geom_hline(yintercept = -2, linetype = 'dotted') +
geom_vline(xintercept = 2* mean(data$tleverage), linetype = 'dotted') +
geom_point(alpha = 0.2) +
geom_point(data = tinfluence, aes(x = tleverage, y = tstresid), color = 'red') +
geom_text(data = tinfluence, aes(label =rn), vjust = 2)+
geom_text(data = tfara, aes(label = rn), hjust = 1.2)
#Independence of Error
durbinWatsonTest(fit.trans)
lag Autocorrelation D-W Statistic p-value
1 -0.03398472 2.061504 0.494
Alternative hypothesis: rho != 0
#Multicollinearity
vif(fit.trans)
lphyheal smenheal sstress
1.376151 1.447262 1.198400
1/vif(fit.trans)
lphyheal smenheal sstress
0.7266644 0.6909599 0.8344462
mean(vif(fit.trans))
[1] 1.340604
# Outlier
mean(data$tstresid > 2 | data$tstresid < -2) # are 95% within +/- 2?
[1] 0.07096774
mean(data$tstresid > 2.5 | data$tstresid < -2.5) # are 99% within +/- 2.5?
[1] 0.02580645
p.trans
p.trans.two
\(\newpage\)
1. Does the analysis violate any statistical assumptions? Why or why not?
The studentized and standardized residuals look not evenly and symmetrically distributed.
p.hs3
The QQ-plot confirms a clear deviation from normality of the standardized residuals, thus violating the assumption of normally distributed residuals.
p.hs2
Finally, the fitted values plotted against the standardized residuals should show randomly normal distributed dots around the horizontal line. But in the plot many dots are concentrated on the left near the line, are not symmetric and the standard deviation increases across the x-axis clearly indicating heteroscedasticity. Thus, the assumption of Homoscedasticity is violated, too.
p.hs
2. Does this analysis violate any statistical assumptions?
There are no violations anymore.
3. Do the residuals look normal, linear, and homoscedastic? Why or why not? That is, describe what features of the scatterplot between the predicted and residual values indicate each assumption is either met or violated.
The residuals look normal because the QQ-Plot shows a line with no major deviations and the distribution shown by the histogram also looks normal.
p.log.hs3
p.log.hs2
The residuals also look symmetrically distributed around the horizontal line (indicating normal distribution). Violating normal distribution would mean the dots being non-symmetrically distributed or the dots being more concetrated further away from the horizontal line. There is only a mild curverlinear trend, thus the assumption of linearity is not violated. The datapoints are relatively evenly and randomly spread indicating homoscedasticity. Heteroscedasticity would be obvious by an increasing standard deviation across the x-axis leading to the typical ‘funnel look’.
p.log.hs
4. How many multivariate outliers are there?
There is one outlier (id 479, rownumber 345) if the criterium is +/- 3 SD of the residuals. But it has not as much influence as the observation with id 405 according to the Cook’s distance. But this observation still has no critical value. However, 6.45%/2.6% of the data are more than 2 SD/2.5 SD spread away fromt the mean indicating a poor fit.
p.log.o
OPTIONAL: Perform another regression analysis, but this time use the transformed measures of all variables. Does the inclusion of transformed predictors further reduce violations of model assumptions?
There remains one multivariate outlier (345), which still has not the highest influence (and no other observations have too high influence). But the percentage of standardized residuals > +/-2 increased to ~7% indicating a poor fit of the linear model. Assumptions of Normality, Linearity and Homoscedasticity are not violated. See plots for visual summary:
p.trans
p.trans.two
Use the results of the regression that included the transformed score of TIMEDRS to answer the following questions:
5. What is the best predictor of TIMEDRS (i.e., largest beta)?
As seen in the output: PHYHEAL is the best predictor.
fit.log.beta <- lm.beta(fit.log)
summary(fit.log.beta)
Call:
lm(formula = timedrs.log ~ phyheal + menheal + stress, data = data)
Residuals:
Min 1Q Median 3Q Max
-1.95865 -0.44076 -0.02331 0.42304 2.36797
Coefficients:
Estimate Standardized Std. Error t value Pr(>|t|)
(Intercept) 0.3903862 0.0000000 0.0882908 4.422 1.22e-05 ***
phyheal 0.2019361 0.5043975 0.0173624 11.631 < 2e-16 ***
menheal 0.0071442 0.0313336 0.0101335 0.705 0.481
stress 0.0013158 0.1868692 0.0002837 4.638 4.58e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.7625 on 461 degrees of freedom
Multiple R-squared: 0.3682, Adjusted R-squared: 0.3641
F-statistic: 89.56 on 3 and 461 DF, p-value: < 2.2e-16
6. How much of the variance of TIMEDRS is explained by the three predictors?
\(R^2 = 0.3682\)
7. Complete the table below. Be sure to include all values in the table and in the table note (i.e., the n and R2).
| b | SE | \(\beta\) | p-value | |
|---|---|---|---|---|
| Stressful life events | 0.001 | 0.0002 | 0.187 | < 0.001 |
| Mental health problems | 0.007 | 0.0101 | 0.031 | < 0.481 |
| Physical health problems | 0.202 | 0.017 | 0.504 | < 0.001 |
Note: n = 465, Adjusted \(R^2\) = 0.36
8. What is the value of the intercept?
0.3904
9. Describe the results of this regression analysis in APA style. Be sure to describe the analysis, the amount of explained variance, and the magnitude and direction of statistically significant associations.
| B | SE B | \(\beta\) | P | |
|---|---|---|---|---|
| Constant | 0.39 | 0.09 | < 0.001 | |
| Stressful life events | \(0.001^{***}\) | < 0.001 | 0.187 | < 0.001 |
| Mental health problems | 0.007 | 0.01 | 0.031 | < 0.481 |
| Physical health problems | \(0.20^{***}\) | 0.02 | 0.504 | < 0.001 |
Note: n = 465, Adjusted \(R^2\) = 0.36
There are three types of linear (and logistic) regression: Standard regression (which is performed in SPSS by default) consists of all predictors being simultaneously entered (i.e., in a single block).
Hierarchical regression consists of predictors being entered in different blocks to determine the unique contributions of variables of interest (i.e., change in R-square). Statistical regression involves different method for entering and removing predictors that improve the predictive value of the regression analysis. That is, SPSS decides which predictors to include and exclude from the regression model. Statistical regression is performed just like standard multiple regression, except that the METHOD is different. For statistical regression, the methods are FORWARD, BACKWARD, and STEPWISE (for standard regression the method is ENTER).
10. Standard regression has been used and reported thus far. Describe situations when hierarchical and statistical regression would be more appropriate than using a standard regression.
This depends on the research question. Hierarchical regression would be more appropriate to answer the question: Does (e.g.) \(X_1\) significantly add to prediction of Y after differences among subjects in \(X_2\) and \(X_3\) have been statistically eliminated?
Statistical regression would be more appropriate to answer: What is the best linear combination of IVs to predict the DV in this sample?